visual condition
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > India > Uttar Pradesh (0.04)
- Asia > Cambodia (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > India > Uttar Pradesh (0.04)
- North America > United States > Utah > Grand County (0.04)
- (5 more...)
- Information Technology (0.46)
- Leisure & Entertainment (0.46)
SemanticControl: A Training-Free Approach for Handling Loosely Aligned Visual Conditions in ControlNet
Joung, Woosung, Chae, Daewon, Kim, Jinkyu
ControlNet has enabled detailed spatial control in text-to-image diffusion models by incorporating additional visual conditions such as depth or edge maps. However, its effectiveness heavily depends on the availability of visual conditions that are precisely aligned with the generation goal specified by text prompt-a requirement that often fails in practice, especially for uncommon or imaginative scenes. For example, generating an image of a cat cooking in a specific pose may be infeasible due to the lack of suitable visual conditions. In contrast, structurally similar cues can often be found in more common settings-for instance, poses of humans cooking are widely available and can serve as rough visual guides. Unfortunately, existing ControlNet models struggle to use such loosely aligned visual conditions, often resulting in low text fidelity or visual artifacts. To address this limitation, we propose SemanticControl, a training-free method for effectively leveraging misaligned but semantically relevant visual conditions. Our approach adaptively suppresses the influence of the visual condition where it conflicts with the prompt, while strengthening guidance from the text. The key idea is to first run an auxiliary denoising process using a surrogate prompt aligned with the visual condition (e.g., "a human playing guitar" for a human pose condition) to extract informative attention masks, and then utilize these masks during the denoising of the actual target prompt (e.g., cat playing guitar). Experimental results demonstrate that our method improves performance under loosely aligned conditions across various conditions, including depth maps, edge maps, and human skeletons, outperforming existing baselines. Our code is available at https://mung3477.github.io/semantic-control.
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.14)
- Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > China > Heilongjiang Province > Daqing (0.04)
AQUA-SLAM: Tightly-Coupled Underwater Acoustic-Visual-Inertial SLAM with Sensor Calibration
Xu, Shida, Zhang, Kaicheng, Wang, Sen
Abstract--Underwater environments pose significant challenges for visual Simultaneous Localization and Mapping (SLAM) systems due to limited visibility, inadequate illumination, and sporadic loss of structural features in images. Addressing these challenges, this paper introduces a novel, tightly-coupled Acoustic-Visual-Inertial SLAM approach, termed AQUA-SLAM, to fuse a Doppler Velocity Log (DVL), a stereo camera, and an Inertial Measurement Unit (IMU) within a graph optimization framework. The proposed system will be made open-source for the community. These vehicles are indispensable occasionally outside the camera's field of view leading to for tasks such as seabed mapping, pipeline and intermittent loss of visual tracking. Therefore, although visual cable inspections, biological and environmental monitoring, SLAM techniques have recently made tremendous progress and the maintenance of underwater infrastructure. A key in terrestrial settings [1], [2], [3], their performance and application area is the detailed visual inspection of subsea robustness are inevitably compromised in underwater due to structures, including offshore wind turbine foundations, where the complex and dynamic nature of aquatic environments. Considering cameras are widely equipped on underwater (IMU), known as visual-inertial SLAM (VI-SLAM) [4], [5], robots, visual Simultaneous Localization and Mapping can alleviate some of the challenges arising from transient, (SLAM) techniques emerge as natural solutions. The rapid attenuation of underwater SLAM systems, particularly against shortterm of light energy in water severely limits the visibility of visual disruptions, can be substantially enhanced [6]. However, most of the challenges for underwater vision, such Moreover, underwater vision often suffers from poor lighting as the limited visibility and the "marine snow", are longterm and blizzards of "marine snow" caused by small particles of effects that last at least from tens of seconds to a few organic matter in water, severely reducing image quality with minutes before being mitigated. VI-SLAM also encounters increased motion blur and dynamic image regions.
- Europe > North Sea (0.04)
- Atlantic Ocean > North Atlantic Ocean > North Sea (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Greece > Ionian Islands > Corfu (0.04)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts.
VLM-driven Behavior Tree for Context-aware Task Planning
Wake, Naoki, Kanehira, Atsushi, Takamatsu, Jun, Sasabuchi, Kazuhiro, Ikeuchi, Katsushi
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages Vision-Language Models (VLMs) to interactively generate and edit BTs that address visual conditions, enabling context-aware robot operations in visually complex environments. A key feature of our approach lies in the conditional control through self-prompted visual conditions. Specifically, the VLM generates BTs with visual condition nodes, where conditions are expressed as free-form text. Another VLM process integrates the text into its prompt and evaluates the conditions against real-world images during robot execution. We validated our framework in a real-world cafe scenario, demonstrating both its feasibility and limitations.
- Asia > Japan > Shikoku > Kagawa Prefecture > Takamatsu (0.05)
- North America > United States > Washington > King County > Redmond (0.04)
SOWing Information: Cultivating Contextual Coherence with MLLMs in Image Generation
Pei, Yuhan, Wang, Ruoyu, Yang, Yongqi, Zhu, Ye, Russakovsky, Olga, Wu, Yu
Originating from the diffusion phenomenon in physics, which describes the random movement and collisions of particles, diffusion generative models simulate a random walk in the data space along the denoising trajectory. This allows information to diffuse across regions, yielding harmonious outcomes. However, the chaotic and disordered nature of information diffusion in diffusion models often results in undesired interference between image regions, causing degraded detail preservation and contextual inconsistency. In this work, we address these challenges by reframing disordered diffusion as a powerful tool for text-vision-to-image generation (TV2I) tasks, achieving pixel-level condition fidelity while maintaining visual and semantic coherence throughout the image. We first introduce Cyclic One-Way Diffusion (COW), which provides an efficient unidirectional diffusion framework for precise information transfer while minimizing disruptive interference. Building on COW, we further propose Selective One-Way Diffusion (SOW), which utilizes Multimodal Large Language Models (MLLMs) to clarify the semantic and spatial relationships within the image. Based on these insights, SOW combines attention mechanisms to dynamically regulate the direction and intensity of diffusion according to contextual relationships. Extensive experiments demonstrate the untapped potential of controlled information diffusion, offering a path to more adaptive and versatile generative models in a learning-free manner.
- Asia > China > Hubei Province > Wuhan (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- (3 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
Qin, Can, Zhang, Shu, Yu, Ning, Feng, Yihao, Yang, Xinyi, Zhou, Yingbo, Wang, Huan, Niebles, Juan Carlos, Xiong, Caiming, Savarese, Silvio, Ermon, Stefano, Fu, Yun, Xu, Ran
Achieving machine autonomy and human control often represent divergent objectives in the design of interactive AI systems. Visual generative foundation models such as Stable Diffusion show promise in navigating these goals, especially when prompted with arbitrary languages. However, they often fall short in generating images with spatial, structural, or geometric controls. The integration of such controls, which can accommodate various visual conditions in a single unified model, remains an unaddressed challenge. In response, we introduce UniControl, a new generative foundation model that consolidates a wide array of controllable condition-to-image (C2I) tasks within a singular framework, while still allowing for arbitrary language prompts. UniControl enables pixel-level-precise image generation, where visual conditions primarily influence the generated structures and language prompts guide the style and context. To equip UniControl with the capacity to handle diverse visual conditions, we augment pretrained text-to-image diffusion models and introduce a task-aware HyperNet to modulate the diffusion models, enabling the adaptation to different C2I tasks simultaneously. Trained on nine unique C2I tasks, UniControl demonstrates impressive zero-shot generation abilities with unseen visual conditions. Experimental results show that UniControl often surpasses the performance of single-task-controlled methods of comparable model sizes.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > India > Uttar Pradesh (0.04)
- North America > United States > Utah > Grand County (0.04)
- (5 more...)
- Information Technology (0.46)
- Leisure & Entertainment (0.46)
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
Yin, Shengming, Wu, Chenfei, Yang, Huan, Wang, Jianfeng, Wang, Xiaodong, Ni, Minheng, Yang, Zhengyuan, Li, Linjie, Liu, Shuguang, Yang, Fan, Fu, Jianlong, Ming, Gong, Wang, Lijuan, Liu, Zicheng, Li, Houqiang, Duan, Nan
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{https://msra-nuwa.azurewebsites.net/}
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China (0.04)